Data Set Overview

This dataset contains audio statistics of the top 2000 tracks on Spotify from 1998-2020. The dataset was retrieved from Kaggle. The data includes about 18 columns each describing the track and its qualities. We are particularly interested in the genre of the track, valence, tempo, loudness, mode, key, energy, danceability, year, song title, and artist.

Data Set Objectives

We want to review the following objectives:

  1. Define data and key information
  2. Stratify by Genres, Key, Explicitness
  3. Review trends and the data over time
  4. Song Attributes over time, Stratify by Pop
  5. Distribution of Song Lengths
  6. Central Limit Theorem application on Songs
  7. Sampling of Songs by Genre

Key information regarding Data

artist: Name of the Artist.

song: Name of the Track.

duration_ms: Duration of the track in milliseconds.

explicit: The lyrics or content of a song or a music video contain one or more of the criteria which could be considered offensive or unsuitable for children.

year: Release Year of the track.

popularity: The higher the value the more popular the song is.

danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.

key: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.

loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.

mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. genre: Genre of the track.

Genres

What genres were included in the top hits of songs from 1998-2020?

SpotifyTopHits_genre <- trimws(unlist(str_split(SpotifyTopHits$genre, ",")))
y <- data.frame(SpotifyTopHits_genre)
p <- plot_ly(x = table(y)["blues"], type="bar", name = 'Blues')  %>%
  layout(
    title = "Song Genres of Spotify Top Hits",
    xaxis = list(title = "Song Count"),
    yaxis = list(title = "Genres")
  )
p <- add_trace(p, x = ~table(y)["classical"], name = 'Classical')
p <- add_trace(p, x = ~table(y)["country"], name = 'Country')
p <- add_trace(p, x = ~table(y)["Dance/Electronic"], name = 'Dance/Electronic')
p <- add_trace(p, x = ~table(y)["Folk/Acoustic"], name = 'Folk/Acoustic')
p <- add_trace(p, x = ~table(y)["hip hop"], name = 'hip hop')
p <- add_trace(p, x = ~table(y)["jazz"], name = 'jazz')
p <- add_trace(p, x = ~table(y)["latin"], name = 'latin')
p <- add_trace(p, x = ~table(y)["metal"], name = 'metal')
p <- add_trace(p, x = ~table(y)["pop"], name = 'pop')
p <- add_trace(p, x = ~table(y)["R&B"], name = 'R&B')
p <- add_trace(p, x = ~table(y)["rock"], name = 'rock')
p <- add_trace(p, x = ~table(y)["set"], name = 'set')
p <- add_trace(p, x = ~table(y)["World/Traditional"], name = 'World/Traditional')
p

Genre Findings

Pop music dominants all genres in the Top Spotify Hits data set. This is rather unsurprising since pop music has catchy rhythms that make us want to sing along and dance. The lyrics usually repeat themselves, which makes them easy to remember. Pop music also typically revolves around the same themes and topics, which makes it easy to enjoy. We will take a closer look and isolate pop music to see if there is a strong correlation between the attributes

Song Keys

Key Findings

The most number of songs are in the C major key, followed by A#/Bb, and then songs that are not categorized in a particular key. Coming in closely behind N/A are songs in the F#/Gb key. It seems that since 1998, the least amount of Spotify Top Hit songs are in the D key. An interesting fact, D major key songs are typically the keys of triumph, warcries, victories, which would make sense as songs about these themes could be less listened to.

Song Explicitness

Song Explicitness Findings

Out of the dataset, 1,449 songs were not explicit while 551 songs were labeled explicit. This could contribute to Spotify Top Hits criteria since this data set is not limited to a certain age group.

Spotify Top Hits by Tempo from 1998 - 2020

Tempo Findings

Based on the aggregated data from 1998 - 2020, most songs had a BPM of 128.

Spotify Top Hits by average Tempo from 1998 - 2020

Song Tempo Findings

It appears that from 1998 - 2020, the BPM, tempo has increased over time. This looks like a steady increase and supports our previous finding that most songs were pulled in the 128 BPM range.

Song Attributes over the Years

Song Attributes

Let’s take a closer look at the Song Attributes

As mentioned above, are we able to see any correlations between the attributes of song?

## Top 3 for attributes: duration_ms 
## Top Correlated Attributes: explicitness speechiness popularity 
## Top Correlation Values: 0.12 0.07 0.05 
## 
## Top 3 for attributes: year 
## Top Correlated Attributes: tempo explicitness danceability 
## Top Correlation Values: 0.08 0.08 0.03 
## 
## Top 3 for attributes: popularity 
## Top Correlated Attributes: duration_ms explicitness loudness 
## Top Correlation Values: 0.05 0.05 0.03 
## 
## Top 3 for attributes: danceability 
## Top Correlated Attributes: valence explicitness speechiness 
## Top Correlation Values: 0.4 0.25 0.15 
## 
## Top 3 for attributes: energy 
## Top Correlated Attributes: loudness valence liveness 
## Top Correlation Values: 0.65 0.33 0.16 
## 
## Top 3 for attributes: key 
## Top Correlated Attributes: valence danceability year 
## Top Correlation Values: 0.04 0.03 0.01 
## 
## Top 3 for attributes: loudness 
## Top Correlated Attributes: energy valence liveness 
## Top Correlation Values: 0.65 0.23 0.1 
## 
## Top 3 for attributes: mode 
## Top Correlated Attributes: tempo explicitness liveness 
## Top Correlation Values: 0.05 0.05 0.03 
## 
## Top 3 for attributes: speechiness 
## Top Correlated Attributes: explicitness danceability duration_ms 
## Top Correlation Values: 0.42 0.15 0.07 
## 
## Top 3 for attributes: acousticness 
## Top Correlated Attributes: year popularity duration_ms 
## Top Correlation Values: 0.03 0.02 0.01 
## 
## Top 3 for attributes: instrumentalness 
## Top Correlated Attributes: energy tempo danceability 
## Top Correlation Values: 0.04 0.03 0.02 
## 
## Top 3 for attributes: liveness 
## Top Correlated Attributes: energy loudness speechiness 
## Top Correlation Values: 0.16 0.1 0.06 
## 
## Top 3 for attributes: valence 
## Top Correlated Attributes: danceability energy loudness 
## Top Correlation Values: 0.4 0.33 0.23 
## 
## Top 3 for attributes: tempo 
## Top Correlated Attributes: energy year loudness 
## Top Correlation Values: 0.15 0.08 0.08 
## 
## Top 3 for attributes: explicitness 
## Top Correlated Attributes: speechiness danceability duration_ms 
## Top Correlation Values: 0.42 0.25 0.12

Song Attributes Findings

In the three figures above, songs were analyzed to see if there is any correlation between the attributes that were calculated. Interestingly enough, there are a few correlations to note that could provide some insight into the data.

Loudness vs. Energy - loudness is calculated by decibels - energy represents a perceptual measure of intensity and activity. - the two attributes seem to have a good positive correlation at +0.65

Acousticness vs Energy - the two attributes seem to have a weak negative correlation at -0.45

Danceability vs Valence - the two attributes seem to have a weak positive correlation at +0.4

Explicitness vs Speechiness - the two attributes seem to have a weak positive correlation at -0.42

What if we only looked at Pop songs? How will that impact Song attributes?

## Pop Top 3 for attributes: duration_ms 
## Pop Top Correlated Attributes: liveness acousticness explicitness 
## Pop Top Correlation Values: 0.08 0.07 0.07 
## 
## Pop Top 3 for attributes: year 
## Pop Top Correlated Attributes: speechiness explicitness loudness 
## Pop Top Correlation Values: 0.2 0.15 0.04 
## 
## Pop Top 3 for attributes: popularity 
## Pop Top Correlated Attributes: acousticness duration_ms tempo 
## Pop Top Correlation Values: 0.04 0.03 0.03 
## 
## Pop Top 3 for attributes: danceability 
## Pop Top Correlated Attributes: valence energy instrumentalness 
## Pop Top Correlation Values: 0.57 0.18 0.11 
## 
## Pop Top 3 for attributes: energy 
## Pop Top Correlated Attributes: loudness valence tempo 
## Pop Top Correlation Values: 0.67 0.44 0.19 
## 
## Pop Top 3 for attributes: key 
## Pop Top Correlated Attributes: speechiness instrumentalness tempo 
## Pop Top Correlation Values: 0.11 0.05 0.05 
## 
## Pop Top 3 for attributes: loudness 
## Pop Top Correlated Attributes: energy valence danceability 
## Pop Top Correlation Values: 0.67 0.29 0.1 
## 
## Pop Top 3 for attributes: mode 
## Pop Top Correlated Attributes: acousticness duration_ms tempo 
## Pop Top Correlation Values: 0.09 0.05 0.03 
## 
## Pop Top 3 for attributes: speechiness 
## Pop Top Correlated Attributes: year tempo liveness 
## Pop Top Correlation Values: 0.2 0.17 0.15 
## 
## Pop Top 3 for attributes: acousticness 
## Pop Top Correlated Attributes: mode duration_ms popularity 
## Pop Top Correlation Values: 0.09 0.07 0.04 
## 
## Pop Top 3 for attributes: instrumentalness 
## Pop Top Correlated Attributes: energy danceability valence 
## Pop Top Correlation Values: 0.13 0.11 0.1 
## 
## Pop Top 3 for attributes: liveness 
## Pop Top Correlated Attributes: speechiness energy loudness 
## Pop Top Correlation Values: 0.15 0.14 0.1 
## 
## Pop Top 3 for attributes: valence 
## Pop Top Correlated Attributes: danceability energy loudness 
## Pop Top Correlation Values: 0.57 0.44 0.29 
## 
## Pop Top 3 for attributes: tempo 
## Pop Top Correlated Attributes: energy speechiness loudness 
## Pop Top Correlation Values: 0.19 0.17 0.08 
## 
## Pop Top 3 for attributes: explicitness 
## Pop Top Correlated Attributes: year speechiness danceability 
## Pop Top Correlation Values: 0.15 0.13 0.1

Pop Song Attributes Findings

Filtering out other genres helped increase our correlation attributes. This is expected as categorizing genres is another way of grouped attributes. However, since we are filter outliers to the group, we can see that we lose some correlated attributes.

Loudness vs Energy - Improved to +0.67

Acousticness vs Energy - no correlation found between two attributes

Danceability vs Valence - Improved to +0.57

Explicitness vs Speechiness - Decreased to +0.13

Distribution of Song Lengths in the Spotify Top Hits

Distribution Findings

The chart above indicates that the distribution of songs in the Spotify Top Hits has a mean of 3.812 mins with a standard deviation of 0.652 mins. If you are planning to compose a song in the near future, it is comfortable to say that if you are outside 3 standard deviation or compose a song greater than 5.5 mins long, you might not have a hit song. Alternatively, if you compose a song less than 2 mins long, you might find yourself at a similar situation if you compose a song greater than 5.5 mins long.

Central Limit Theroem

Drawing Samples of Songs and seeing on average what are the lengths of the song

## Sample Size =  10  Mean =  3.817295  SD =  0.2028002
## Sample Size =  20  Mean =  3.815265  SD =  0.1518441
## Sample Size =  30  Mean =  3.806269  SD =  0.1153173

## Sample Size =  40  Mean =  3.810629  SD =  0.1044319

Central Limit Theorem Findings

“The Central Limit Theorem states that the distribution of the sample means for a given sample size of the population has the shape of the normal distribution. The theorem is shown with various distributions of the input data in the following sections.”

It is pretty clear that as you sample more, your distribution will look more normal, and your standard deviation should get closer and closer to 0. This data set is already good in that we continue to get a lower standard deviation value with greater sampling.

Sampling of top 5 song genres via Simple Random Sample Without Replacement, Systematic Sampling, and Stratified Sampling

Findings

For Sampling, we wanted to see out of the top 5 Genres of Spotify Top Hits can we see a different result based on groups of various data.

Simple random sampling is when a specified sample is selected from the larger group or larger frame. In our case, each genre has an equivalent opportunity of getting selected, with a sample size of 100 across all samples. Out of the total songs of 1348, there will be 100 randomly selected without replacement.

We also took a look at Systematic Sampling where there are rules decided to pick the sample. Selection bias may occur as a result of systematic sampling if there is a pattern in the input frame. For a sample size of 100, the data is divided into 27 groups. Data will take every 27th item. If systematic sampling is computed, we may see some fluctuation in the data. In the category pop, dance/electronic, we may see an increase of songs selected in that genre. We may also see an increase in the hip hop pop category.

Lastly, we took a look at the stratified sample which occurs when the larger group of data is broken into smaller groups and then certain sizes are picked from each group. In this analysis, we looked at top 5 genres but with a sample size 50.

Conclusion

Throughout the analysis, it is important ot understand that this data set is not indicative of Top Hits from 1998-2020. This is a data set from Spotify using Spotify’s API, and it should not lead to conclusions outside of the application. It is clear however, that popular songs typically fall in the Pop category and there are certain attributes that are associated with the pop category. Certain genres like Jazz, Classical and Blues have trouble getting into the top hits, but that could be attributed to how popularity is calculated. The Spotify listener population tend to prefer upbeat, and uplifting songs, so it makes sense that these genres may be misrepresented. A thorough look through the Song Attributes by year can see a trend of the energy metric increasing significantly from 1998 to 2020. This is further supported by the danceability metric also steadily increasing in the same time period.